Structured Triplet Learning with POS-tag Guided Attention for Visual Question Answering
نویسندگان
چکیده
Visual question answering (VQA) is of significant interest due to its potential to be a strong test of image understanding systems and to probe the connection between language and vision. Despite much recent progress, general VQA is far from a solved problem. In this paper, we focus on the VQA multiple-choice task, and provide some good practices for designing an effective VQA model that can capture language-vision interactions and perform joint reasoning. We explore mechanisms of incorporating part-ofspeech (POS) tag guided attention, convolutional n-grams, triplet attention interactions between the image, question and candidate answer, and structured learning for triplets based on image-question pairs 1. We evaluate our models on two popular datasets: Visual7W and VQA Real Multiple Choice. Our final model achieves the state-of-the-art performance of 68.2% on Visual7W, and a very competitive performance of 69.6% on the test-standard split of VQA Real Multiple Choice.
منابع مشابه
Image-Question-Linguistic Co-Attention for Visual Question Answering
Our project focuses on VQA: Visual Question Answering [1], specifically, answering multiple choice questions about a given image. We start by building MultiLayer Perceptron (MLP) model with question-grouped training and softmax loss. GloVe embedding and ResNet image features are used. We are able to achieve near state-of-the-art accuracy with this model. Then we add image-question coattention [...
متن کاملABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering
We propose a novel attention based deep learning architecture for visual question answering task (VQA). Given an image and an image-related question, VQA returns a natural language answer. Since different questions inquire about the attributes of different image regions, generating correct answers requires the model to have questionguided attention, i.e., the attention on the regions correspond...
متن کاملSegmentation Guided Attention Networks for Visual Question Answering
In this paper we propose to solve the problem of Visual Question Answering by using a novel segmentation guided attention based network which we call SegAttendNet. We use image segmentation maps, generated by a Fully Convolutional Deep Neural Network to refine our attention maps and use these refined attention maps to make the model focus on the relevant parts of the image to answer a question....
متن کاملAutomatic Multi-Layer Corpus Annotation for Evaluation Question Answering Methods: CBC4Kids
Reading comprehension tests are receiving increased attention within the NLP community as a controlled test-bed for developing, evaluating and comparing robust question answering (NLQA) methods. To support this, we have enriched the MITRE CBC4Kids corpus with multiple XML annotation layers recording the output of various tokenizers, lemmatizers, a stemmer, a semantic tagger, POS taggers and syn...
متن کاملAutomatic Multi-Layer Corpus Annotation for Evaluating Question Answering Methods: CBC4Kids
Reading comprehension tests are receiving increased attention within the NLP community as a controlled test-bed for developing, evaluating and comparing robust question answering (NLQA) methods. To support this, we have enriched the MITRE CBC4Kids corpus with multiple XML annotation layers recording the output of various tokenizers, lemmatizers, a stemmer, a semantic tagger, POS taggers and syn...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1801.07853 شماره
صفحات -
تاریخ انتشار 2018